Introduction

Bicycle-sharing has exploded in popularity in the last decade. As of 2014, public bike sharing services were available in 50 countries and over 700 cities with close to a million bikes in circulation. Bicycle-sharing systems allow for users to borrow a bike from one location and return it to another dock in the city as long as both docks belong to the same system. One such system includes Citi Bike, which is a privately owned company sponsored by Citigroup, that operates in New York City and Jersey City. With approximately 130,000 subscribers, it has been proven to be one of the more popular systems. And with that, there is a ton of interesting data to consider.

The purpose of this file is to analyze Jersey City bike patterns between January and September of 2016. We will answer questions regarding who our main segment is, where bikers ride, how far they go, where we can find supply deficits, the effects of weather on riding behavior, and much more. We hope to find relevent insights that will help in improving and optimizing the system.

Understanding the Data

library(dplyr)
library(readr)

#Combining the Datasets
full_citi_bike_df <- list.files(full.names = TRUE) %>% lapply(read_csv) %>% bind_rows
full_citi_bike_df$Gender <- as.factor(full_citi_bike_df$Gender)

A quick exploration

It appears that greater than 60% of riders are male. The median trip duration for all riders is around 500 seconds or 8.33 minutes.

str(full_citi_bike_df)
summary(full_citi_bike_df)

barplot(prop.table(table(full_citi_bike_df$Gender)), main = "Gender Proportion", sub = "Figure 1.1", xlab = "Gender", ylab = "Relative Frequency", names.arg= c("NA", "Males", "Females"), col = "forestgreen")

barplot(prop.table(table(full_citi_bike_df$`Start Station Name`)), main = "Start Station Proportions", sub = "Figure 1.2",  ylab = "Relative Frequency", xlab = "Start Station Name", col = "blue")

barplot(prop.table(table(full_citi_bike_df$`End Station Name`)), main = "End Station Proportions", sub = "Figure 1.3", ylab = "Relative Frequency", xlab = "End Station Name", col = "blue" )

boxplot(full_citi_bike_df$`Trip Duration`, ylim = c(0,1500), main = "Trip Duration Summary Overall", sub = "Figure 1.4", ylab = "Trip Duration")

** Identifying some patterns in the data**

How does age affect ride duration and length of bike rides? Which age group uses our bikes the most?

As we can see from our first bar chart, it appears that the average trip duration is highest for the age bracket 46-60. However, since we took the mean, outliers may be skewing this data. As for the other age groups, trip duration seems to be pretty uniform across the board. If we look at figure 2.2 we can see that there are a fews riders in the lower age brackets that took longer trips in comparison to the rest. Still, we can confirm that the trip durations are relatively uniform across the other age brackets with the exception of a few young folks who have riden longer.

Figure 2.3 shows results for distance where age bracket 61-75 rode the furthest. In figure 2.4, it is interesting to see that younger individuals (age groups 16-30 and 31-45) travel further on their trips. This makes sense since they are are probably more fit and have the endurance to do so.

Finally, we can see in figure 2.5 that more than 50% of our observations fall in the 31-45 age group. This means that this is probably our target audience followed by individuals aged 16-30.

library(ggplot2)
library("geosphere")

#gives average trip duration for each of the ages
full_citi_bike_df$Age <- (2016 - full_citi_bike_df$`Birth Year`)
#aggregate(full_citi_bike_df$`Trip Duration`, list(full_citi_bike_df$Age), mean, na.rm = TRUE) 
#Creating age groups
library(scales)
full_citi_bike_df['Age.Group'] <- NA
full_citi_bike_df$Age.Group <- ifelse(full_citi_bike_df$Age>15 & full_citi_bike_df$Age<31, "16-30", ifelse(full_citi_bike_df$Age>30 & full_citi_bike_df$Age<46, "31-45", ifelse(full_citi_bike_df$Age>=46 & full_citi_bike_df$Age<=60, "46-60", ifelse(full_citi_bike_df$Age>=61 & full_citi_bike_df$Age<=75, "61-75", ifelse(full_citi_bike_df$Age>=76 & full_citi_bike_df$Age<=90, "76-90", ifelse(full_citi_bike_df$Age>=91 & full_citi_bike_df$Age<=105, "91-105", ifelse(full_citi_bike_df$Age>=106 & full_citi_bike_df$Age<=120, "106-120", NA)))))))

#gives average trip duration for each of the age groups
age.group.list <- aggregate(full_citi_bike_df$`Trip Duration`, list(full_citi_bike_df$Age.Group), mean, na.rm = TRUE) 
ggplot(age.group.list) + geom_bar(mapping = aes(x = Group.1, y = x), stat = 'identity', fill = "steelblue") + ggtitle("Age Group versus Trip Duration") + labs(x = "Age Groups", y = "Trip Duration") + labs(subtitle = "Figure 2.1")

minus_outlier = full_citi_bike_df[-176420,] #to remove extreme val 
ggplot(minus_outlier, aes(Age.Group, `Trip Duration`)) + geom_jitter(width = 0.25, aes(colour = Age.Group)) + ggtitle("Age Group versus Trip Duration") + labs(x = "Age Groups", y = "Trip Duration", subtitle = "Figure 2.2") + theme(legend.position="none") + scale_y_continuous(labels = comma)

#getting average trip distance (in meters) for each age group
full_citi_bike_df$Distance <- distHaversine(full_citi_bike_df[ ,c("Start Station Longitude", "Start Station Latitude")],full_citi_bike_df[,c("End Station Longitude", "End Station Latitude")])
age.group.list.distance <- aggregate(full_citi_bike_df$Distance, list(full_citi_bike_df$Age.Group), mean, na.rm = TRUE)
ggplot(age.group.list.distance) + geom_bar(mapping = aes(x = Group.1, y = x), stat = 'identity', fill = "steelblue") + ggtitle("Age Group versus Trip Distance") + labs(x = "Age Groups", y = "Trip Distance", subtitle = "Figure 2.3")

ggplot(full_citi_bike_df, aes(Age.Group, Distance)) + geom_jitter(width = 0.25, aes(colour = Age.Group)) + ggtitle("Age Group versus Distance") + labs(x = "Age Groups", y = "Distance", subtitle = "Figure 2.4") + theme(legend.position="none")

#which age group is our main audience?
barplot(prop.table(table(full_citi_bike_df$Age.Group)), main = "Age Group Frequencies", sub = "Figure 2.5", col = "navy")

Is there a difference in the number of rides between men and women of different ages?

Across all age groups, it appears that men use Citi Bikes more than women. In figure 2.8, we can see the difference to be most significant in the 31-45 range, where it seems to be that men are taking close to 3x the number of rides as women.

In terms of relative frequencies, figure 2.9 shows that there is a higher proportion of female customers who are in the 16-30 age group in comparison to male customers who are in the same age group. Further, we see that among male customers there is a larger proportion who are in the 46-60 age group than there are among female customers. This signifies that older males are more likely to be customers than younger men, and the same trend holds true for the 61-75 age group.

citi_bike_without_zero_gender <- subset(full_citi_bike_df, Gender!= 0) #removing rows where gender is unavailable
citi_bike_without_zero_gender["Genders"] <- NA
citi_bike_without_zero_gender$Genders <- ifelse(citi_bike_without_zero_gender$Gender==1, "Males", ifelse(citi_bike_without_zero_gender$Gender==2, "Females", NA))

rides_per_age_group_gender <- ggplot(data = citi_bike_without_zero_gender) + geom_bar(mapping = aes(x = citi_bike_without_zero_gender$Age.Group), fill = "steelblue") + facet_wrap(~citi_bike_without_zero_gender$Genders, ncol = 2) + xlab("Age Group") + ggtitle("Number of Rides per Age Group") + labs(subtitle = "Figure 2.8")
rides_per_age_group_gender

relative_frequencies_age_and_gender <- ggplot(citi_bike_without_zero_gender, aes(Age.Group, group = Gender)) + 
          geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count") + 
          scale_y_continuous(labels=scales::percent) +
          ylab("Relative Frequencies (Percentage)") +
          facet_grid(~Genders) + xlab("Age Group") + ggtitle("Percentage of Customers in Each Age Group by Gender") + theme(legend.position="none") + labs(subtitle = "Figure 2.9")
relative_frequencies_age_and_gender

Station Traffic by Gender

According to our heat maps below, it appears that there is barely any difference between where men and women are starting their bike rides or ending them.

library(ggmap)

#Getting map from Google Maps API
map2 <- get_map(location = c(lon = -74.04425, lat = 40.72760), zoom = 14, maptype = "roadmap", source = "google")


ggmap(map2, extent = "device") + stat_density2d(data = citi_bike_without_zero_gender, aes(x = `Start Station Longitude`, y = `Start Station Latitude`, fill = ..level.., alpha = ..level..), geom = "polygon", size = 0.01, bins = 16) + scale_fill_gradient(low = "red", high = "green") + facet_wrap(~Genders) + ggtitle ("Start Station Traffic by Gender") + theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(hjust = 0.5))

map3 <- get_map(location = c(lon = -74.04425, lat = 40.72760), zoom = 14, maptype = "roadmap", source = "google")


ggmap(map3, extent = "device") + stat_density2d(data = citi_bike_without_zero_gender, aes(x = `End Station Longitude`, y = `End Station Latitude`, fill = ..level.., alpha = ..level..), geom = "polygon", size = 0.01, bins = 16) + scale_fill_gradient(low = "red", high = "green") + facet_wrap(~Genders) + ggtitle ("End Station Traffic by Gender") + theme(axis.title.x = element_blank(), axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), plot.title = element_text(hjust = 0.5))

Asymmetric Traffic: Stations running out of bikes because of assymetric traffic (arrivals and departures are not equal) is a big problem for Citi Bike.

Which stations can use an increase in their bike storage capacity?

See Interactive Map Below

#creating two tables, one for frequency of departures and another for arrivals for each station.
library(plyr)
departure_freq = count(full_citi_bike_df, "`Start Station Name`") 
arrival_freq = count(full_citi_bike_df, "`End Station Name`")

names(departure_freq) <- c("Station_Names")
names(arrival_freq) <- c("Station_Names")

total <- merge(departure_freq,arrival_freq,by="Station_Names")
names(total) <- c("Station_Names", "Num_Departures", "Num_Arrivals")
total["Difference"] <- (total$Num_Arrivals - total$Num_Departures)
total <- total[1:47,] #removing NA
ordered_total <- total[order(total$Difference),]
ordered_total$Bike.Deficit <- ifelse(ordered_total$Difference<= -250, "Major Bike Deficit", ifelse(ordered_total$Difference>-250 & ordered_total$Difference<0, "Minor Bike Deficit", ifelse(ordered_total$Difference>= 0, "No Bike Deficit", NA)))

#getting longitude and latitude data into our ordered_total data frame:
names(ordered_total) <- c("Start Station Name", "Num_Departures", "Num_Arrivals", "Difference", "Bike Deficit")
long_and_lat_merged <- merge(ordered_total, full_citi_bike_df, by= "Start Station Name")
long_and_lat_merged = long_and_lat_merged[!duplicated(long_and_lat_merged$`Start Station Name`),]
long_and_lat_merged[,6:9] <- NULL
long_and_lat_merged[,8:22] <- NULL
#making an interactive map using the leaflet package
library(leaflet)
library(sp)

long_and_lat_merged$`Start Station Longitude` <- as.numeric(long_and_lat_merged$`Start Station Longitude`)
long_and_lat_merged$`Start Station Latitude` <- as.numeric(long_and_lat_merged$`Start Station Latitude`)

long_and_lat_merged.sp <- SpatialPointsDataFrame(long_and_lat_merged[,c(6,7)], long_and_lat_merged[,-c(6,7)])


pal <- colorFactor(c("red", "orange" , "green"), domain = c("Major Bike Deficit", "Minor Bike Deficit", "No Bike Deficit"))


m <- leaflet() %>% addTiles() %>% addCircleMarkers(data = long_and_lat_merged.sp, lng= long_and_lat_merged$`Start Station Longitude`, lat = long_and_lat_merged$`Start Station Latitude`, popup = paste(long_and_lat_merged$`Start Station Name`, " ", "|" , " ", "Number of Bikes:  ", long_and_lat_merged$Difference), color = ~pal(long_and_lat_merged$`Bike Deficit`)) %>% addLegend("bottomright", pal = pal, values = long_and_lat_merged$`Bike Deficit`, title = "Bike Deficits")

m

How does temperature affect citi bike rides?

There doesn’t seem to be much predictive power in temperature. For all models run there appears to be a very, very low R squared value. Given that these are commuter bikes often used to get to and from work it makes sense that a) if the weather is bad enough to warrant not using a bike then the customer would not get on the bike in the first place, so as to avoid being stuck with the bike in bad weather or weather that does not allow for biking b) if the customer is already on the bike and the weather is not ideal for biking then given that bikes must be dropped off at stations and cannot be left simply anywhere the customer would most likely complete their bike ride and c) given that these bikes are frequently used to get to and from work most customers using these bikes for this purpose can not afford to simply get off the bike in bad weather and not get to work.

weather_data = read.csv("weather_data.csv")
full_citi_bike_df$`Start Time` <- as.Date(as.POSIXct(full_citi_bike_df$`Start Time`))
weather_data$DATE <- as.Date(as.character(weather_data$DATE), "%Y%m%d")
colnames(weather_data)[3] <- "Start Time"

merged_weather_and_citi <- merge(weather_data,full_citi_bike_df,by="Start Time", check.names = FALSE)
colnames(merged_weather_and_citi)[1] <- "Date"

#Checking if any of the weather variables are significant
Overall_Fit <- lm(Distance~PRCP.x+SNWD.x+SNOW.x+TMAX.x+TMIN.x+AWND.x, data = merged_weather_and_citi)
summary(Overall_Fit)

#Checking variables that were significant in previous model (low r-squared, weak predictive power)
Temp_Precip_Fit <- lm(Distance~PRCP.x+TMAX.x+TMIN.x, data = merged_weather_and_citi)
summary(Temp_Precip_Fit)
#Checking Avg temperatures as opposed to min and max temperatures
merged_weather_and_citi["Avg_Temp"] <- (merged_weather_and_citi$TMAX.x + merged_weather_and_citi$TMIN.x)/2

#Checking just avg_temperature, whether this has predictive power for distance (low R-squared)
avg_temp_fit <- lm(Distance~Avg_Temp, data = merged_weather_and_citi)
summary(avg_temp_fit)

#Quadratic linear regression, squaring prcp for more predictive power as well (low R-squared)
temp_model_distance <- lm(Distance~scale(Avg_Temp) + I(scale(Avg_Temp)^2) + (PRCP.x)^2, data = merged_weather_and_citi)
summary(temp_model_distance)

#five_percent_data <- merged_weather_and_citi %>% sample_frac(.05) #gives you randomly selected 20 percent of the data

#Checking trip duration as opposed to distance (low R-squared)
trip_duration_avg_temp_fit <- lm(`Trip Duration`~PRCP.x+Avg_Temp, data = merged_weather_and_citi)
summary(trip_duration_avg_temp_fit)

#Quadratic regression for trip duration prediction based off of average temperature (low R-squared)
temp_model_duration <- lm(`Trip Duration`~scale(Avg_Temp) + I(scale(Avg_Temp)^2), data = merged_weather_and_citi)
summary(temp_model_duration)

Additional Findings

How does speed relate to our other variables?

Looking at just the overall mean for all customers we see a very slight increase in speed when there is precipitation as opposed to when there is no precipitation.

Next, looking at a multiple linear regression of precipitation, average temperature, gender and age group on speed we see the following relationship:

  1. For every additional inch of precipitation, speed is 0.0821226 m/s slower. Speed and precipitation levels are slightly negatively correlated.

  2. For every additional degree in temperature, speed is 0.0051149 m/s slower.

  3. The fastest age group is 31-45.

  4. When looking at interaction effects, we see the only significant one is the interaction between precipitation and average temperature. For every one degree increase in temperature, speed decreases by .005 m/s on average all else equal.

It is important to note that the R-Squared value of this model is very low, and thus only has limited predictive power, nonetheless the individual variables are significant for this model.

#Creating a speed column that is distance (in meters) divided by duration (in seconds). The final units for speed are m/s.
merged_weather_and_citi["Speed"] <- (merged_weather_and_citi$Distance/merged_weather_and_citi$`Trip Duration`)


#Subsetting data into rides that experienced precipitation versus those that did not
No_Precipitation_Data <- merged_weather_and_citi[which(merged_weather_and_citi$PRCP.x==0), ]
Precipitation_Data <- merged_weather_and_citi[which(merged_weather_and_citi$PRCP.x>0), ]

#First finding average speed across all customers when there is precipitation and then finding average speed across all customers when there is no precipitation
avg_speed_no_precipitation = mean(No_Precipitation_Data$Speed, na.rm = TRUE)
cat("The average speed of customers when there is no precipitation is:", avg_speed_no_precipitation, "m/s")
## The average speed of customers when there is no precipitation is: 2.369845 m/s
avg_speed_precipitation = mean(Precipitation_Data$Speed, na.rm = TRUE)
cat("The average speed of customers when there is precipitation is:", avg_speed_precipitation, "m/s")
## The average speed of customers when there is precipitation is: 2.419181 m/s
cor(Precipitation_Data$PRCP.x, Precipitation_Data$Speed)
## [1] -0.07140937

However, there is a slight negative correlation between precipitation and speed. Increases in precipitation may lead to slight decreases in speed.

merged_weather_and_citi$Age.Group <- as.factor(merged_weather_and_citi$Age.Group)
speed_fit <- lm(Speed ~ PRCP.x + Avg_Temp + Gender + Age.Group, data = merged_weather_and_citi)
summary(speed_fit)

#Checking if there are any intearction effects. 
speed_fit_interactions <- lm(Speed ~ PRCP.x + Avg_Temp + Gender + Age.Group + PRCP.x*Avg_Temp + PRCP.x*Age.Group + PRCP.x*Gender, data = merged_weather_and_citi)
summary(speed_fit_interactions)

Recommendations

  1. The three stations with the largest bike deficits are McGinley Square, Brunswick St, and Sip Ave. At these stations, supply is failing to meet potential demand. To realize this opportunity, I would recommend that Citi Bike reorganize their bike placements. For example, Liberty Light Rail, Lincoln Park, and Union Street have enough excess supply of bicycles to be able to transfer enough over to Sip Ave. to fully meet demand. There would be no costs associated with producing new bikes since we would be transferring bicycles that are currently in the system. We would need to only consider the overhead in transporting a large number of bikes. This model would be easier to employ in Jersey City as there are only sixteen locations with major bike deficits. However, New York City may be experiencing deficits on a much larger scale so this method should be assessed on a case by case basis.

  2. March is by far the most popular month for business. We hypothesized that the reason why users do not ride as much in the summer is because the high temperatures make biking an unviable method of transportation to and from work. I recommend that Citi Bike attempt to smooth out demand across the months by adjusting their prices during those that are less popular. By slightly reducing the price, users will be incentivized to take bikes out. However, this will only be worthwhile in the long-run if the increase in rides offsets the decrease in revenue per ride due to the price reduction.

  3. Many bike riders are loyal customers to the Citi Bike system. In the 31-45 age group, men are taking 3x the number of rides as women and the average ride lasts 8.55 minutes. I recommend introducing a loyalty program where users are given 10 minutes of free riding for every 100 minutes completed. This will work to stimulate demand amongst customers who already use the bikes. They will strive to reach these checkpoints and subsequently ride longer than 10 minutes on their next ride. Keeping track of the bikes will be easy. However, we will want to consider how we can accurately keeping track of customers’ usage. For subscribers, we could potentially look into maintaining a specifc User ID.